Goto

Collaborating Authors

 medical field


Advances in Large Language Models for Medicine

Kan, Zhiyu, Gan, Wensheng, Qi, Zhenlian, Yu, Philip S.

arXiv.org Artificial Intelligence

Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.


PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

Kalahroodi, Mohammad Javad Ranjbar, Sheikholselami, Amirhossein, Karimi, Sepehr, Kalahroodi, Sepideh Ranjbar, Faili, Heshaam, Shakery, Azadeh

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 34.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, 3-10% of questions can only be answered correctly in Persian due to cultural and clinical contextual cues that are lost in translation. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating bilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset is available: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA .


Transfer or Self-Supervised? Bridging the Performance Gap in Medical Imaging

Zhao, Zehui, Alzubaidi, Laith, Zhang, Jinglan, Duan, Ye, Naseem, Usman, Gu, Yuantong

arXiv.org Artificial Intelligence

Transfer Learning Using a light-weight model trained with target dataset directly can outperform the pre-trained TL model using natural images.[30] Self-Supervised Learning Pre-trained SSL model using natural images does not perform well with target COVID-19 samples and need further guidance from user.Problem Summary: domain discrepancy during pre-training willdegrade pre-trained model's performance[31] Transfer Learning Utilising pre-trained TL model does not bring significant improvement tothe target medical dataset with an imbalanced sample distribution.[32] Self-Supervised Learning The imbalanced source and target datasets lead to poor model performance even after self-supervised pre-training.Problem Summary: neither TL or SSL methods show improvedperformance towards imbalanced datasets[33] Transfer Learning The complexity of model and pre-training process makes it hard to understand the results and reduce the reliability of predictions.[34] Self-Supervised Learning The pre-training process of SSL model is fully unsupervised, which raised the concern for whether the model have fully understand the target dataset or is making predictions based on random factors.Problem Summary: the complexity of knowledge transferringprocess raised concerns of model reliabilityTable 1: Four main issues that constrained the application of pre-train methods in the medical field are summarised here: 1. the performance gap between TL and SSL in different data modalities, 2. the domain mismatch gap between source and target domain, 3. the challenge of data imbalance scenarios, 4. the difficulty in model explainability and analysis.


The future of Apple Vision Pro is in medicine

Popular Science

Apple's 3,500 Vision Pro sounds like a bargain compared to the price of a fresh, medical-grade cadaver. And some medical institutions have started practicing surgery using the spatial-computing headset, which doesn't require a physical human body. Replacing cadavers is just one example of how the Vision Pro has made its way into the medical field since it hit the market in February 2024. On January 30-31, 2025, Sharp Healthcare hosted the inaugural Spatial Computing Health Care Summit, where medical providers gathered to discuss their use of spatial computing, which embeds digital objects into a live feed of the real world. The same tech that allows people to play virtual Battleship with each other has moved into applications that include everything from training and education to full-fledged operations on human patients.


Accurate Medical Named Entity Recognition Through Specialized NLP Models

Hu, Jiacheng, Bao, Runyuan, Lin, Yang, Zhang, Hanchao, Xiang, Yanlin

arXiv.org Artificial Intelligence

This study evaluated the effect of BioBERT in medical text processing for the task of medical named entity recognition. Through comparative experiments with models such as BERT, ClinicalBERT, SciBERT, and BlueBERT, the results showed that BioBERT achieved the best performance in both precision and F1 score, verifying its applicability and superiority in the medical field. BioBERT enhances its ability to understand professional terms and complex medical texts through pre-training on biomedical data, providing a powerful tool for medical information extraction and clinical decision support. The study also explored the privacy and compliance challenges of BioBERT when processing medical data, and proposed future research directions for combining other medical-specific models to improve generalization and robustness. With the development of deep learning technology, the potential of BioBERT in application fields such as intelligent medicine, personalized treatment, and disease prediction will be further expanded. Future research can focus on the real-time and interpretability of the model to promote its widespread application in the medical field.


Towards Clinical AI Fairness: Filling Gaps in the Puzzle

Liu, Mingxuan, Ning, Yilin, Teixayavong, Salinelat, Liu, Xiaoxuan, Mertens, Mayli, Shang, Yuqing, Li, Xin, Miao, Di, Xu, Jie, Ting, Daniel Shu Wei, Cheng, Lionel Tim-Ee, Ong, Jasmine Chiat Ling, Teo, Zhen Ling, Tan, Ting Fang, RaviChandran, Narrendar, Wang, Fei, Celi, Leo Anthony, Ong, Marcus Eng Hock, Liu, Nan

arXiv.org Artificial Intelligence

The ethical integration of Artificial Intelligence (AI) in healthcare necessitates addressing fairness--a concept that is highly context-specific across medical fields. Extensive studies have been conducted to expand the technical components of AI fairness, while tremendous calls for AI fairness have been raised from healthcare. Despite this, a significant disconnect persists between technical advancements and their practical clinical applications, resulting in a lack of contextualized discussion of AI fairness in clinical settings. Through a detailed evidence gap analysis, our review systematically pinpoints several deficiencies concerning both healthcare data and the provided AI fairness solutions. We highlight the scarcity of research on AI fairness in many medical domains where AI technology is increasingly utilized. Additionally, our analysis highlights a substantial reliance on group fairness, aiming to ensure equality among demographic groups from a macro healthcare system perspective; in contrast, individual fairness, focusing on equity at a more granular level, is frequently overlooked. To bridge these gaps, our review advances actionable strategies for both the healthcare and AI research communities. Beyond applying existing AI fairness methods in healthcare, we further emphasize the importance of involving healthcare professionals to refine AI fairness concepts and methods to ensure contextually relevant and ethically sound AI applications in healthcare.


Towards Training A Chinese Large Language Model for Anesthesiology

Wang, Zhonghai, Jiang, Jie, Zhan, Yibing, Zhou, Bohao, Li, Yanhong, Zhang, Chong, Ding, Liang, Jin, Hua, Peng, Jun, Lin, Xu, Liu, Weifeng

arXiv.org Artificial Intelligence

Medical large language models (LLMs) have gained popularity recently due to their significant practical utility. However, most existing research focuses on general medicine, and there is a need for in-depth study of LLMs in specific fields like anesthesiology. To fill the gap, we introduce Hypnos, a Chinese Anesthesia model built upon existing LLMs, e.g., Llama. Hypnos' contributions have three aspects: 1) The data, such as utilizing Self-Instruct, acquired from current LLMs likely includes inaccuracies. Hypnos implements a cross-filtering strategy to improve the data quality. This strategy involves using one LLM to assess the quality of the generated data from another LLM and filtering out the data with low quality. 2) Hypnos employs a general-to-specific training strategy that starts by fine-tuning LLMs using the general medicine data and subsequently improving the fine-tuned LLMs using data specifically from Anesthesiology. The general medical data supplement the medical expertise in Anesthesiology and enhance the effectiveness of Hypnos' generation. 3) We introduce a standardized benchmark for evaluating medical LLM in Anesthesiology. Our benchmark includes both publicly available instances from the Internet and privately obtained cases from the Hospital. Hypnos outperforms other medical LLMs in anesthesiology in metrics, GPT-4, and human evaluation on the benchmark dataset.


Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning

Apostol, Elena-Simona, Truică, Ciprian-Octavian

arXiv.org Artificial Intelligence

The healthcare environment is commonly referred to as "information-rich" but also "knowledge poor". Healthcare systems collect huge amounts of data from various sources: lab reports, medical letters, logs of medical tools or programs, medical prescriptions, etc. These massive sets of data can provide great knowledge and information that can improve the medical services, and overall the healthcare domain, such as disease prediction by analyzing the patient's symptoms or disease prevention, by facilitating the discovery of behavioral factors for diseases. Unfortunately, only a relatively small volume of the textual eHealth data is processed and interpreted, an important factor being the difficulty in efficiently performing Big Data operations. In the medical field, detecting domain-specific multi-word terms is a crucial task as they can define an entire concept with a few words. A term can be defined as a linguistic structure or a concept, and it is composed of one or more words with a specific meaning to a domain. All the terms of a domain create its terminology. This chapter offers a critical study of the current, most performant solutions for analyzing unstructured (image and textual) eHealth data. This study also provides a comparison of the current Natural Language Processing and Deep Learning techniques in the eHealth context. Finally, we examine and discuss some of the current issues, and we define a set of research directions in this area.


Question-Answering Model for Schizophrenia Symptoms and Their Impact on Daily Life using Mental Health Forums Data

Internò, Christian, Ambrosini, Eloisa

arXiv.org Artificial Intelligence

In recent years, there is strong emphasis on mining medical data using machine learning techniques. A common problem is to obtain a noiseless set of textual documents, with a relevant content for the research question, and developing a Question Answering (QA) model for a specific medical field. The purpose of this paper is to present a new methodology for building a medical dataset and obtain a QA model for analysis of symptoms and impact on daily life for a specific disease domain. The ``Mental Health'' forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts of active users, who regularly participate, were extrapolated providing a new method of obtaining low-bias content and without privacy issues. Furthermore, it is shown how to pre-process the dataset to convert it into a QA dataset. The Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, RoBERTa, and BioBERT models were fine-tuned and evaluated via F1-Score, Exact Match, Precision and Recall. Accurate empirical experiments demonstrated the effectiveness of the proposed method for obtaining an accurate dataset for QA model implementation. By fine-tuning the BioBERT QA model, we achieved an F1 score of 0.885, showing a considerable improvement and outperforming the state-of-the-art model for mental disorders domain.


Nanorobotics in Medicine: A Systematic Review of Advances, Challenges, and Future Prospects

Rajendran, Shishir, Sundararajan, Prathic, Awasthi, Ashi, Rajendran, Suraj

arXiv.org Artificial Intelligence

Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, NY, USA Abstract Nanorobotics offers an emerging frontier in biomedicine, holding the potential to revolutionize diagnostic and therapeutic applications through its unique capabilities in manipulating biological systems at the nanoscale. Following PRISMA guidelines, a comprehensive literature search was conducted using IEEE Xplore and PubMed databases, resulting in the identification and analysis of a total of 414 papers. The studies were filtered to include only those that addressed both nanorobotics and direct medical applications. Our analysis traces the technology's evolution, highlighting its growing prominence in medicine as evidenced by the increasing number of publications over time. Applications ranged from targeted drug delivery and single-cell manipulation to minimally invasive surgery and biosensing. Despite the promise, limitations such as biocompatibility, precise control, and ethical concerns were also identified. This review aims to offer a thorough overview of the state of nanorobotics in medicine, drawing attention to current challenges and opportunities, and providing directions for future research in this rapidly advancing field. Introduction Nanorobotics, a field merging nanotechnology with teleoperated and autonomous robotics, presents groundbreaking solutions that are unattainable with conventional robotics. A nanorobot, also known as a nanomachine, is a miniature mechanical or electromechanical device designed to perform specific tasks at the nanoscale level [1]. Contrary to nanorobotics, nanoparticles are tiny particles with unique properties, used for applications like drug delivery. Nanorobotics involves designing molecular-scale robots for tasks such as targeted medical procedures.